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Abstract 


Currently, our analytical competences are struggling to keep-up the pace of in-deep 
analysis of all generated large-scale data resultant of high-throughput omics platforms. 
While, a substantial effort was spent on methods enhancement regarding technical 
aspects across many detection omics platforms, the development of integrative down- 
stream approaches is still challenging. Systems biology has an immense applicability 
in the biomedical and pharmacological areas since the main goal of those focuses in 
the translation of measured outputs into potential markers of a Human ailment and/ 
or to provide new compound leads for drug discovery. This approach would become 
more straightforward and realistic to use in standard analysis workflows if the collation 
of all available information of every component of a biological system was ensured 
into a single database framework, instead of search and fetch a single component at 
time across a scatter of databases resources. Here, we will describe several database 
resources, standalone and web-based tools applied in disease analytics workflows based 
in data-driven integration of outputs of multi-omic detection platforms. 


Keywords: systems medicine, bioinformatics, omics, data integration, 
pathway analysis 


1. Introduction 


Over the last decade the emergence of high-throughput screening platforms and 
the increase in availability of large-scale-omics data, as well as clinical data from 
electronic health records comprising phenotypic, therapeutic and environmental 
factors information opened the possibility to mechanistically understand diseases 
and diseases stages at the molecular level. Thereby, a great number of wealth data 
in many kidney and cardiovascular conditions was generated, however these find- 
ings were neither translated nor reached the clinical setting and are still enclosed in 
peer-reviewed literature and across general scope expression profiling databases. 
Simultaneously it has become apparent that the existing systems to integrate and 
correlate this data are either inadequate or non-existent. Due to the multi-factorial 
molecular phenotype of disease, it is evident that development of novel therapeutic 
and disease detection approaches should be based upon the study of the entire 
“System” simultaneously. Figure 1 gives a general overview in the fundamental 
difference between conventional and systems approaches, whereby in the context 
of conventional approaches a hypothesis is put forward that is assumed to be of 
importance in the disease or biological condition. This hypothesis is then tested 
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Figure 1. 

Overview of general differences between conventional and Systems Biology approaches in biological and disease 
analysis research. Red arrows show the path of the conventional hypothesis-driven methodology including 
testing of a hypothesis, usually employing lab-based investigations, and re-adjusting the hypothesis dependent 
on outcomes. Blue arrows denote a systems approach, where data are integrated and analysed, producing a 
model system and a hypothesis that can be verified using conventional methods. Outputs of such an approach 
are usually fed back into the model or the data analysis stream to refine models, adjust hypothesis or confirm 
the established model. 


and either validated or refuted based on the outcome of this hypothesis-driven 
methodology. Yet, it is obvious that it is easy to investigate any hypothesis and then 
choose the one that appears most correct, in the real world constraints such as time 
and financial resources do not allow for such an approach, and hypotheses are usu- 
ally generated on a best-guess basis which can lead to a substantial amount of bias, 
resulting in skewed or partial insights and can often be misleading. In order to avoid 
such scenarios, research driven by the data itself rather than a hypothesis has been 
proposed a long time ago, but could not be properly implemented due to the lack of 
unbiased large-scale data or the ability to integrate disparate data in the first place. 
Additionally, a successful systems approach requires underlying prior knowledge, 
such as physicochemical parameters in how molecules interact with each other, what 
reactions they are involved in and other unconnected information. This knowledge 
has only slowly been accumulated through conventional research and has only 

over the last 10-15 years been available to such an extent where a systems approach 
became feasible. Data-driven systems biology-based diagnostic and prognostic 
models consisting of relevant panels of molecules—key branches of the cellular 
network, appear to more accurately reflect pathophysiology than traditional hypoth- 
esis-driven approaches, consequently, may have a much higher chance of success and 
implementation in the clinical setting. Of the most pronounced effects is the crossing 
between research borders and the urge for multidisciplinary integration of biology, 
chemistry, computing sciences, mathematics, and medicine to tackle the complexity 
of such system. To get a holistic view of a system’s biology, multiple and different 
types of observations must be combined, such as clinical which includes pathologi- 
cal, demographical, epidemiological, and as well as molecular, which includes 
large-scale genotyping, gene expression, proteomics, metabolomics, and lipidomics 
data. The downside of such an approach in disease analytics or data integration is 
the rise in complexity both in output as well as in methods needed to generate those, 
and the skills required to interpret and contextualise outcome parameters. However, 
biological and disease models generated this way allow for a higher confidence in 
generating testable hypotheses, disease classifications on a molecular level and 
identification of overlapping and divergent pathways of malignant conditions. 
Ultimately, the removal of bias and integration of all available data, both clinical and 
biological, leads to a far better understanding of disease and enables the identifica- 
tion of intervention points with higher confidence and accuracy. 
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2. Disease classification boundaries 


The standard resource for disease taxonomy relies primarily on the International 
Classification of Diseases (ICD) which displays information on diseases and health 
conditions, and a continuous monitoring of the associated epidemiological sta- 
tistical trends World Health Organisation [1]. The foundations of the ICD disease 
classification relies mainly in a type of evidence-based medicine with distinction 
of clinical features, including patient symptoms, histological assessment, and 
evaluation of risk factors [2]. While widely used in the clinical setting, in the era of 
“big-data” and precision medicine, its rigid hierarchical structure lacks the flexibil- 
ity needed to accommodate the fast and expanding molecular-insights of disease- 
phenotypes captured across many -omics platforms [3]. Moreover, to support this 
notion of undefined disease boundaries across current disease classification, we can 
observe the existence of co-occurring conditions that if seen as a unified biological 
network, could provide information about common multi-functional genes, cel- 
lular pathways, as well the impact of lifestyle [4]. Additionally, analysis of disease 
progression with the presence of overlapping conditions through evaluation of 
temporal correlation and disease progression patterns condensed from a population 
can become useful in the prediction and prevention at the patient’s individual level 
in future disease-associate events [5]. 

Further disease taxonomy refinement can be achieved by applying network 
analysis [3] of combined disease phenotypes sourced from ICD-9 with protein- 
protein interactions (PPI’s) data from STRING [6] and additional curation efforts 
of gene-disease associations (GDA) from several data sources. The network analysis 
allowed for reclassified of pancreatic cancer into 11 subclasses, which is consistent 
with the number of molecular subtypes observed in the Bailey et al. [7] study. They 
also proposed the use of such approach in drug repurposing, for instance therapy 
with metformin, a well-known agent used to treat type 2 diabetes mellitus (T2DM), 
that could regulate the imbalanced status of the microbiota community in the gut 
mucosae, a known cause of pathological chronic bowel inflammation as occurs 
in Crohn's disease and ulcerative colitis [8], and also act as preventable agent to 
reduce the risk of colorectal cancer. Moreover, molecular profiling associated with 
histologic assessment seems to yield enhanced probabilistic scores in graft survival 
predictions. For instance, joint integration of multi-center histology features in 
renal biopsies and gene-array data yielded a new molecular score system able to 
predict renal graft survival [9] and improving the diagnosis of antibody-mediated 
rejection of transplanted in hearts [10]. Such approaches can also be implemented 
to assess disease trajectory, treatment selection and monitoring in many neoplasms, 
and could be specially tailored for cases where the tumour primary site is of 
unknown origin [11]. 


3. Systems biology towards systems medicine 


Over the last 15 years, the rise of systems biology as a research field has changed 
how we look at human normal physiological function and has helped to uncover 
disease complexity. Now scientists use systems biology approaches to understand 
the big picture of how all the pieces interact in an organism. The inference of 
genotype-phenotype relationships boosted by the assembly of a high-quality 
human genome opened the avenue for the development of reference maps of 
interactome networks, [12] consisting of binary association pairs, for instance PPI’s, 
protein-DNA/RNA, or protein-metabolite interactions. Figure 2 shows the essential 
biological molecular interactions governing cell behaviour in an over-simplified 
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Figure 2. 

Description of the essential known relationships/interactions in an over-simplified biological system. 
Transcription factor (TF), microRNA (miRNA), post-translational modifications (PTMs). The illustration 
does not account for epigenetic modifications, for instance DNA methylation and histone modifications known 
to occur and regulate gene expression. Dark coloured arrows denote entity associations, while self-circular 
arvows describe self-pair interactions or modifications. 


biological system. A curated compilation of high-quality sources of binary inter- 
actions is considered a prime resource in the Systems Biology field and thereby 
enabling a deeper understanding of the larger picture—be it at the level of the 
organism, organ, tissue, or cell—by putting its components together. It’s in stark 
contrast to decades of reductionist biology, which merely focuses on the properties 
of its individual components [13]. Most disease conditions exhibit expression of 
complex disease phenotypes [13], such as obesity, metabolic syndrome, autoim- 
mune diseases and renal diseases. 

Using the words of Ronald Germain to provide a definition of Systems Biology, 
he advocates that: “There are an endless number of definitions, it’s even worse than 
the elephant,” that infamous elephant that stymies the attempts of blind men to 
describe it because each feels just one part, “Some people think of it as bioinfor- 
matics, taking an enormous amount of information and processing it.” “The other 
school of thought thinks of it as computational biology, computing on how the sys- 
tems work. You need both parts.” Ironically, to best understand this novel approach, 
we should take a reductionist approach to defining its parts. The system, it seems, is 
more than the sum of its parts [14]. Systems Biology requires comprehensive data at 
all molecular levels, a profound understanding of biological systems, data-criteria 
based assessment and in-deep understanding of the limitations of the techniques 
used in the experimental setup. Moreover, systems biology requires prior knowl- 
edge either published or sourced from biological databases and newly predicted 
and frequent molecular events requires further in vivo/vitro validation [15]. Systems 
Biology is cross-disciplinary: “[...] a scientific approach that combines the principles 
of engineering, mathematics, physics, and computer science with extensive experi- 
mental data to develop a quantitative as well as a deep conceptual understanding of 
biological phenomena, permitting prediction and accurate simulation of complex 
(emergent) biological behaviours” (Ronald Germain in [14]). Furthermore, sys- 
tems biology promotes understanding of the functional roles and interplays of all 
molecules in cells in health and disease. Also provides a framework for large-scale 
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data-driven analysis and predictions based on prior knowledge of experimentally 
identified interactions and pathways [16]. Thus, more relevant that the underlying 
high-throughput screening methods, including genomics, proteomics, metabolo- 
mics, and also bioinformatics approaches is the use of such methods in a integrative 
manner to holistically understand how nonlinear processes and their outcomes are 
regulated in a biological system [17]. 


3.1 Bridging the gap between fields 


Over the last 10 years, major efforts to reclassify diseases based on molecular 
insights from advances in molecular biology, bioinformatics and high-throughput 
screening yielded novel disease subtypes among many disease conditions. The 
use of multiple data types, including clinical endpoints—omics and ontology- 
based data have been used to reconstitute disease phenotypes, classify and to 
refine disease-relationships [18]. Nevertheless, the development of a molecular- 
based disease taxonomy that links global molecular networks with pathological 
phenotype landscapes remains elusive. Systems medicine can be perceived as a 
multi-disciplinary collaborative effort driven by the application of systems biology 
approaches, which includes methodological workflows from high-throughput- 
omics technologies to generate data, warehousing management systems for data 
flow and handling and methods for data analytics and interpretation in the context 
of biomedical research [19]. Ultimately, with further adoption of a systems-based 
approach patients will benefit of a measurable improvement of their health status 
since processes of disease onset and progression will be mechanistically identified, 
leading to new insights regarding disease-disease boundaries, and disease subtyp- 
ing which facilitates ideal pharmacological interventions as drug repurposing [20]. 
For instance, the identification of digoxin, a drug used as therapy for atrial fibrilla- 
tion and congestive heart failure [21] as potential drug candidate for pharmacologi- 
cal intervention in medulloblastoma subtypes 3 and 4 [22]. The authors of the study 
implemented an integrative systems biology approach using genomic data and 
collating existing drug-drug, drug-targets interactions information into a tridi- 
mensional functional-drug network. This approach involved handling omic data 
sets such as DNA-seq—mutated genes, copy-number variation (CNV)—repeated 
sections of the genome, RNA-seq and methylation profiles, combined with clinical 
measurements of patient outcomes (survival data) and fused using network-based 
and probabilistic methods that yielded a network composite with disrupted driver 
signalling networks and potential drug candidates [22]. 


4. Large-scale data: omics platforms 


The advent of new high-throughput technologies (sequencing, array-based and 
mass spectrometry) led to an explosion of available data, not only by the number of 
experiments performed, but also by the data density obtained per experiment. Here, 
we will provide description of detection platforms handling molecular datasets; for 
medical imaging data types and analysis strategies please see the following review [23]. 


4.1 DNA microarrays and next-generation sequencing (NGS) 


Microarray technologies have been widely used in research for primary screening, 
including gene expression profiling and providing genotype-phenotype relation- 
ship. Moreover, if properly designed, microarrays will not only provide information 
on gene expression and expressed single nucleotide polymorphisms (SNPs), but 
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also detect exon junctions and fusion genes [24]. However, identical to PCR-based 
techniques, the design of probes requires prior knowledge. Therefore, microarrays 
are mostly applied in the quantification of known sequences and not for the discov- 
ery of new variants, transcripts or other unknown features [25]. Microarrays have 
numerous limitations. For instance, they render an indirect measurement of the 
relative concentration of a particular nucleic acid sequence [26]. Another limitation 
is based that a DNA-array can only detect sequences that the array was designed for. 
In addition, non-coding RNAS that are not yet recognised as expressed are typically 
not represented on an array [26]. Microarrays are still considered a reliable technique 
for routine and/or initial screening that allows multiplex quantitation of microRNAs 
and gene probes expression in a fast, simple and affordable way. Nevertheless, the 
continuous drop in the cost of NGS at a level that virtually matches the cost of DNA 
microarray-based platforms, thus is foreseen that DNA-arrays will be fully replaced 
by sequencing methods within the next decade [26]. 


4.2 Proteomics 


The use of omics technologies, including quantitative proteomics methods aims 
to identify and quantify the dynamics of protein abundance, in order to gaina 
deeper understanding of the associated biological functions. Thereby, the quantifi- 
cation of the expression level and state of all proteins at a given time can characterise 
physiological-states at the cellular-level [27]. Mass spectrometry (MS) technology, 
particularly tandem mass spectrometry (MS/MS), has been utilised as a discovery 
engine in proteomics [28]. This technology allows for identification and simultane- 
ously quantification of hundreds or even thousands of proteins in an experimental 
setup, which enables real-time comparisons for instance between two or more physi- 
ological states [29]. Furthermore, peptide sequence composition will directly impact 
on ionisation efficiency, and their intensities observed in a spectrum often do not 
reflect their abundances, [30] thereby many label-free or label-based quantitation 
methods have arisen to allow comparative proteomic analysis. For instance, label- 
free proteomic approaches such as ion intensity, spectral counting have a simplified 
workflow when compared to labelling techniques; have no theoretical limit concern- 
ing multiplexing capability providing an improved proteome coverage, but lower 
quantification accuracy when compared with labelling methods (e.g. iTRAQ;: iso- 
baric tags for relative and absolute quantitation, SILAC: stable isotope labelling by/ 
with amino acids in cell culture) [30]. In proteomics, several algorithms have been 
developed to query and cross compare MS data. The most popular used to identify 
proteins from raw MS data are for instance, MASCOT, SeQuest, OMSSA, X!Tandem 
[31], Andromeda [32], MS-GF [33], Paragon [34] and more recently, Morpheus 
[35] and an improved SEQUEST-like algorithm—ProLuCID [36]. The rise in the 
number of algorithms and specialised computational tools for analysis of MS-based 
proteomics data sets led to the development of workflows/pipelines such as PEAKS 
[37], MaxQuant [38], OpenMS Proteomics Pipeline (TOPP) [39], Trans-Proteomic 
Pipeline (TPP) [40] and others for further downstream data analysis—Perseus [41]. 


4.3 Metabolomics 


In many metabolomics studies the identification and quantification of metabo- 
lites mainly rely on the application of analytical methods based on mass spec- 
trometry (MS) (either coupled with a liquid or gas-chromatograph) and nuclear 
magnetic resonance (NMR) spectroscopy [42]. Metabolites are defined as small 
molecules, usually less than 1000 Da, which suffer several changes during cellular 
metabolism [43]. The selection of a particular platform depends upon the aims of 
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the experimental study and is typically driven by establishing a compromise among 
sensitivity, specificity, and scanning speed [44]. Metabolomics approaches can be 
globally split either by the full range measurement/analysis of all compounds in 

a given sample—untargeted metabolomics, or targeted metabolomics, in whicha 
set of predefined and biochemically well-characterised compounds are measured 
in a sample [44]. MS has become an essential method for non-targeted profiling 

of metabolites in complex bio-samples, particularly low-abundance metabolites, 
due to its high sensitivity and selectivity capabilities when using liquid chroma- 
tography (LC) coupled to tandem MS/MS [45]. Metabolomics data from NMR and 
MS platforms are complex because they usually contain thousands distinct peaks 
therefore, multivariate statistical analysis plays an important role in metabolomics 
for reducing data dimensions, differentiating similar spectra, and in the develop- 
ment of predictive models [46]. Metabolomics is used as a screening tool in current 
healthcare settings, and could be greatly utilised to monitor therapy efficacy, and 
assess potential drug side-effects [47]. 


5. Data-driven approaches and multi-omics data integration 


In the field of biomedical research adopting an unbiased approach or 
“hypothesis-free” (depending of the author and field of study, also defined as 
hypothesis-generating approach, data-driven research, or discovery research) to 
research can bring several benefits when compared with the widely used scientific 
approach—hypothesis-driven research (traditional approach). In which, the latter, 
in some cases encourages poor scientific practices by forcing/imposing qualita- 
tive and weak hypotheses that Are not prepared for strong statistical inference or 
quantitative analysis (QA) modelling, thereby in such cases an explicitly explor- 
atory approach should be set as default [48]. In order to overcome this problem, 
large-scale approaches such as expression profiling started to become very popular 
in the mid ‘90s, and beginning of 2000, with the advent, rapid development and 
availability of high-throughput mass spectrometry, other methods followed [49]. 
Computational methods to analyse this flood of data were developed accordingly, 
however the majority only focused on one specific technology or experimental 
setup and up to this day are very often not interchangeable in other technological 
platforms. Large-scale approaches employed in omics research need a different 
analysis methodology, which is especially true if integrative analysis techniques 
are employed. True integrative (as opposed to integrating linear relationship data 
such as gene-protein data) approaches go beyond simple data fusion and gave rise 
to the field of Systems Biology. On the other hand, hypothesis-generating research 
(systems biology-derived hypotheses) and hypothesis-driven research are comple- 
mentary, thus combining both approaches will certainly sustain more chances of a 
complete understanding of complex biological systems, than either approach on its 
own [48]. With the advent of high-throughput technologies their application in the 
biomedical field was a foreseen logical step. However, until recently integration of 
multi-omic data was not a common approach in former analysis workflows. The lit- 
erature and publicly available databases are awash with data, yet the main approach 
of integrating all this information in a disease-specific context is traditionally based 
on meta-analysis at best or cannot be accomplished using standard computational 
methods. This molecular information can then be integrated in a further stage by 
means of meta-analysis or by cross-normalisation of data from different acquisition 
platforms [50]. A combinatorial stepwise data integration (Figure 3) approach can 
be used in order to incorporate data from different biological layers of informa- 
tion to predict phenotypic outcomes [51]. On the other side, by recreating the cell 
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Purposed workflow for a data-driven approach. Data generation from omics platforms plus existing biological 
information (a), development of a multi-omics database (b), selection of suitable modelling methods (c), 
model validation and use for hypothesis-generating research (d), lead optimization and candidate selection (e). 


environment and dynamics by describing their interactions on a qualitative and 
quantitative manner and relying on underlying data (prior biological knowledge) 
for connectivity, e.g. PPI’s, molecular co-occurrence, ontologies and enzymatic 
reactions [52]. Large-scale data sets for instance derived from multi-omics plat- 
forms may also be used to infer novel relationships by network learning approaches 
using Bayesian inference models [51] and extracting molecular information from 
multi-layered networks. This approach (as in many others) is challenging since it 
requires enough statistical power, higher number of samples to deduce all the pos- 
sible interactions. Another challenge is due to the lack of uniformisation regarding 
the ‘gold ‘standards (criterions for evaluation) for accepting or rejecting relation- 
ships of the inferred model; however the ability to recreate a well-accepted interac- 
tion can at least be used for benchmarking methods in biological systems [53]. 


6. Biological databases and database systems 


Databases form the basis for most applications in bioinformatics. The number of 
biological databases available now is enormous, the journal of Nucleic Acids Research 
(NAR) catalogues a total of 1737 molecular biology databases (2018 edition) [54]. The 
2018 edition contains an enormous set of 181 papers that describe the adding of 82 
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new biological databases, 84 updates and as well 15 databases published elsewhere. 
However, a prominent issue concerns that many databases are not maintained over 
time and abandoned, yet they persist in database listings. There are many different 
types of databases, ranging from primary databases containing sequence data such as 
nucleic acid or protein; secondary databases or also known as pattern databases hosts, 
that results from the analysis of the sequences held in primary databases. 


6.1 General scope expression databases 


The Gene Expression Omnibus (GEO) [55] is a public repository that functions as 
both warehouse of raw microarray and other gene-based high-throughput data, and 
additionally serves as a platform for gene differential expression (DE) analysis using 
the GEO2R tool across a multitude of experimental conditions of user-submitted 
pre-processed data sets. In the same way, the European counterpart for storing of 
high-throughput genomics exists such as the European Bioinformatics Institute 
(EMBL-EBI) throughout the ArrayExpress database [56]. These data resources are 
both in compliance with community guidelines for description of an experimental 
setup for microarray and high-throughput NGS experiment. Comparatively, there 
is currently much less support for sharing of proteomics and metabolomics data sets 
despite the increasing demand. Public efforts for proteomic data sharing yielded the 
Proteomics Identification Database (PRIDE) that contains over 10,100 user-sub- 
mitted MS-based raw proteomic data sets (September 2018) [57]. PeptideAtlas [58] 
handles re-analysed data sets via the TPP pipeline to provide end-users a consistently 
view over their data. MetaboLights [59] hosts user-submitted metabolomics experi- 
ments, which currently houses 439 experiments (November 2018). The standards 
for reporting proteomics and metabolomics experiments are coordinated by the 
Human Proteome Organisation’s Proteomics Standards Initiative (HUPO-PSI), and 
Metabolomics Standards Initiative (MSI) respectively. 


6.2 Disease profiling databases 


Our group developed more specialised databases resources in several disease 
conditions handling pre-selected data sets containing DE molecules. In nephrol- 
ogy, we developed the Chronic Kidney Disease database (CKDdb) [60] storing 
microRNA, genomics, peptidomics, proteomics and metabolomics information 
relevant to CKD, collected from over 300 studies in the literature and integrated 
into the Pan-omics Analysis DataBase (PADB). The PADB framework (www.padb. 
org) uses gene and protein clusters (CluSO) and mapping of orthologous genes 
(OMAP) between species therefore facilitating data harmonisation from a diverse 
range of omics platforms and across several species, which makes it an invaluable 
resource for systems biology data-driven approaches. Also, many conditions associ- 
ated with the cardiovascular system are covered in the Cardio/ Vascular Disease (C/ 
VD) database [61], which gives special emphasis on coronary artery disease (CAD). 
In neurological associated conditions such as Multiple Sclerosis we also developed 
the MuScle database [62] that stores and integrates curated data sets mined from 
large-scale studies with focus on genomics and miRNA. Likewise, we built a 
cancer-related differential expression database: the Multi-Omics Cancer database 
(MoCadb) that integrates clustered molecular information covering multi-omics 
studies in many gastro-intestinal cancers. In the same framework we also cover an 
assorted disease profiling database valuable for subtractive disease analysis studies, 
the Large-Scale Screening Resource (LSSR) that contains 81,980 entries, referring 
to 13,589 molecules. Moreover, a peak profiling database for biomarker patterns 
research, the Urinary Peptidomics and Peak-maps (UPdb) [63] database that 
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comprises Human urinary fingerprints from 200 subjects analysed mainly through 
surface enhanced laser desorption ionisation-time of flight mass spectrometry 
(SELDI-TOF-MS). 


7. Software tools and solutions 


Many modern high-throughput technologies lead to the generation of excep- 
tionally large-scale and complex datasets, which includes PPI’s, protein-DNA 
interactions, kinase-substrate interactions, qualitative and quantitative genetic- 
interactions gene co-expression [64]. The “Big Data” challenge can be fulfilled by 
the development of Bioinformatics tools to handle these large-datasets to reduce 
their complexity to a level that enables rationale interpretation and in this way is 
more likely to provide new biological insights to the Life Sciences. The compilation 
(not an exhaustive list) of many web-based, standalone tools and R-based packages 
are described in Table 1. They allow the accomplishment of different-omics tasks 


Name Description Webpage Ref. 

iClusterPlus Integrative clustering  bioconductor.org/packages/iClusterPlus [84] 

mixomics Data integration mixomics.org [85] 
(CCA,PLS,PCA) 

omicade4 MCIA and CIA bioconductor.org/packages/omicade4 [86] 

pwOmics Pathway-based bioconductor.org/packages/pwOmics [87] 
integration of omics 

PRESTO Dimensionality github.com/saramcardle/PRESTO [88] 


reduction of 
multivariate data 


caret Classification and cran.r-project.org/web/packages/caret — 
regression training 


GEO2R Identify DE genes ncbi.nlm.nih.gov/geo/geo2r [55] 
using GEOquery & 
limma R packages 


Metabo Analyst Metabolomics metaboanalyst.ca [89] 
analysis 


Networkanalyst/INMEX Integration of gene networkanalyst.ca [90] 
DE via network 
approaches 


ExAtlas Meta-analysis & lgsun.irp.nia.nih.gov/exatlas [91] 
visualisation of gene 
DE 


Elastic net Gene DE with fitted https://zenodo.org/record/16006 [92] 
GLM 


ATHENA Integration of ritchielab.org/software/athena-downloads [93] 
genomics with 
clinical data 


Network propagation Gene DE, mutations, http://apps.cytoscape.org/apps/Diffusion [94] 
PPI’s 


PMA, Penalised Multivariate Analysis; RGCCA, Regularised and Sparse Generalised Canonical Correlation 
Analysis for Multiblock Data; caret, Classification and REgression Training; ATHENA, Analysis Tool for Heritable 
and Environmental Network Associations; CCA, Canonical-Correlation Analysis; PLS, Partial Least Squares; PCA, 
Principal Component Analysis; CIA, Co-Inertia Analysis; MCIA, Multiple Co-Inertia Analysis; GO, gene ontology; 
DE, differential expression; GLM, generalised linear models. 


Table 1. 
Web-based, standalone tools and R packages dedicated to different-omics tasks such as feature selection, sample 
classification, multivariate approaches in data integration and meta-analysis. 


10 


Integrative Systems Biology Resources and Approaches in Disease Analytics 
DOI: http://dx.doi.org/10.5772/intechopen.84834 


such as feature selection, sample classification, multivariate methods. Cytoscape 
[65] is a tool primarily designed for network visualisation and analysis and has 
useful plugins available through the hosting website. Cytoscape makes use of a 
wide wealth variety of plugins to extend its functionality which are designed by 
the scientific community. The platform counts with several freely available apps/ 
plugins (over 300 apps available on November 2018) for a diverse array of uses and 
analysis types. 


7.1 Gene ontology (GO) and pathway-term-enrichment 


The Gene Ontology (GO) consortium [66] aims to capture the increasing 
knowledge on gene function in a controlled vocabulary applicable to a wide range 
of organisms. GO represents genes and gene products attributes on matters of their 
associated biological processes (BP), cellular components (CC) and molecular 
functions (MF). GO is considered roughly hierarchical, with ‘child’ elements 
(terms) being more specific than their ‘parent’ elements (terms), nevertheless, a 
‘child’ element (term) might have more than one parent element. The ClueGO app 
[67] is used for the integration and visualisation of GO and pathway terms sourced 
from KEGG [68], WikiPathways [69] and Reactome [70]. The resultant ClueGO 
network is established based in kappa statistics which shows the agreement on 
how any given gene and/or gene products pairs share similar terms. The ClueGO 
analysis output is conditioned by thresholding of the kappa coefficient, in which a 
higher coefficient conducts only to the visualisation of close-related terms with very 
identical gene products. While, lower kappa coefficients will let visualisation of less 
associated terms. 


7.2 Gene-disease associations (GDA) 


The conclusion of the Human Genome Project led to the massification of 
research related with uncovering genotype—disease phenotype associations [71]. 
This event translated in a disparate growth in the number of publications and on 
the other side a limited and slow paced biocuration of these newly discovered 
evidences. Currently, DisGeNET [72] unifies biomedical literature evidence based 
on GDA collated from a multitude of databases. This database makes use of the 
Medical Subjects Headings (MeSH) tree structure for disease classification by a 
Unified Medical Language System. The potential of the database is extended by 
disgenet2r package and optional programmatic access. 


7.3 Protein-protein interactions (PPIs) 


STRING database [6] collates molecular information to cover both known and 
predicted PPI’s. All molecular interaction data is originally from primary interaction 
databases such as IntAct [73], BioGRID [74] and additional text-mining, coexpres- 
sion and high-throughput experiments and computationally predicted PPIs. The 
up-to-date database version 10.5 comprises nearly 26 million PPI with a confidence 
score greater than 0.9 of more than 9 million proteins across 2031 organisms. 
GeneMANIA is another source for PPIs analysis and is accessible via web interface 
[75], and also as a Cytoscape app that can be used to detect related genes of a input 
query by means of a “guilt-by-association” strategy, which explores the realisation 
that a protein function can be obtained from another by seeing whether it inter- 
acts with another of known function. The app uses a large database of functional 
interaction networks, indexing 2152 association networks containing more than 
500 million interactions mapped to 166,084 genes from nine organisms. 
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7.4 Combining metabolomic and gene expression data 


Multi-omics datasets might not only contain protein and gene data, but also 
expression profiles of chemical compounds. While it is easy and straightforward 
to combine protein/DNA/RNA expression data using common identifiers, this is 
not the case for metabolism end-products—metabolites. This requires a guilt-by- 
association, which explores the rationale that metabolites are frequently produced 
by enzymes and a shift in metabolite expression can reflect an up-stream shift in 
protein or gene expression. This involves semantic searches in enzyme reposito- 
ries—BRENDA to identify potential proteins and has some inherent pitfalls such 
as uncertainty which enzyme/isoform is responsible for the metabolic change. 
Additionally, the same compound could also be generated by several proteins, 
which adds to the uncertainty. Therefore, metabolic datasets are often treated 
as separate entities in multi-omics studies and analysed independently and then 
converged only at the level of final outcomes [76]. The MetScape 3 app [77] for the 
Cytoscape can perform joint analysis of both metabolomic and gene expression data 
and allows visualisation of the entire fused network, or by selecting custom views 
based on metabolic pathways When dealing with large-scale datasets, there is the 
option to use a concept file based on pre-computed gene set enrichment analysis 
(GSEA), along with statistical and fold-change thresholds. 


7.5 Transcription factor (TF )-driven modules and microRNA-target regulation 


Transcription factors (TF) are critical for the regulation of gene expression since 
they control if gene’s DNA is transcribed into RNA [78]. A compendium on non- 
redundant TF and TF binding sites can be found at JASPAR [79]. The number of 
human TF ranges from 1500 to 2600, depending on source and stringency [78]. Direct 
analysis of modulated events due to TFs is not only valuable but might shed light on 
hidden elements that conventional pathway analysis cannot reveal. However, many 
TF binding sites and modulated genes are very hypothetical and often a random guess. 
Therefore, network-based analysis and interpretation involving TF elements should be 
taken with caution. CyTargetLinker [80] for extends existing biological networks by 
adding interactions associated with regulatory elements such as TF-target, miRNA-tar- 
get or drug-targets. The application requires a loaded network with network attributes 
preferentially mapped to Ensembl, NCBI gene, UniProt, miRBase or DrugBank. 
Similarly, in CluePedia [81] users can perform miRNA analysis, by matching it to 
target-genes via selection of different database resources custom versions. Users can 
upload a list of genes and query the app to perform gene/miRNA enrichments. Then it 
will generate a miRNA-target interaction network that can be reused for inline integra- 
tion with GO and pathway term clustering [81] within ClueGO. 


7.6 Pathway mapping and visualisation 
7.6.1 PathVisio pathway mapping and edition 


PathVisio [82] allows drawing, edition, and visualisation of pathways han- 
dling gene, protein and metabolite data that can be further cross-mapped via the 
BridgeDb [83]. Inference of relevant pathways is based on an archive of pre-existent 
pathway maps from WikiPathways [69] and Reactome [70], establishing pathway 
over-representation based on a Z-score statistical procedure under the hyper- 
geometric distribution and a P-value ranking based on a permutation procedure 
(randomisation test) that compares actual and permuted Z-scores. Pathways with a 
permuted P < 0.05 are considered significant by default. 
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7.6.2 KEGG pathway mapping 


KEGG is an integrated database resource of biological systems integrating 
genomic, compound and functional information. KEGG allows analysis of datasets 
from high-throughput omics technologies by uploading a list of genes/proteins or 
metabolites along with optional statistical scores and fold-change values. After con- 
verting to KEGG internal identifiers, the molecular data is matched (KEGG mapper) 
into a collection of curated pathways, covering metabolism, signalling transduction 
pathways, specific pathways for several disease conditions and drug development. 


8. Conclusions and future perspectives 


The availability of large-scale multi-omics data has opened the avenue to gain 
an unrivalled insight in disease-associated molecular pathophysiological changes. 
Simultaneously it has become apparent that systems to integrate and correlate this data 
are either inadequate or non-existent. The literature and publicly available databases are 
awash with data, yet the main approach of integrating all this information in a disease- 
specific context is traditionally based on meta-analysis at best or cannot be accomplished 
using standard computational methods. In order to better model complex organisms, 
samples from multiple tissues of the same individuals should be studied simultane- 
ously using omics data, which will require the development of novel analysis methods. 
Acquiring the relevant tissues and/or body fluid sources from Human study cohorts can 
of course be difficult, thereby comparative systems biology may help identify which 
organisms may be similar enough in each aspect to be used as models. It is sometimes 
suggested that omics technologies and systems biology have failed to deliver many 
breakthrough enhancements to the treatment of complex diseases. In some cases, it 
may be that in fact such diseases are not truly one disease from a system or reductionist 
point-of-view, but several with the same or similar phenotypic end-points—i.e., with the 
current terminology they are unknown subtypes of disease. If this is the case, then the 
overlap between the systems is poor and statistical methods which the approach relies on 
require very large cohorts for identification of these subtypes and subsequent description 
of each system. Other possibilities are that longitudinal data or samples from different 
tissues are required. Other relevant concerns arise from biomarker validation studies, 
such as correlated observations (i.e. multiple observations per patient), multiplicity 
(testing multiple biomarkers or endpoints), multiple clinical endpoints (interest in more 
than one relevant endpoint) and selection bias (from retrospective data or observational 
study). Data-driven investigations using systems biology approaches, although offer 
complete views over the function of biological systems in health and disease its limited 
by the state of completeness of prior biological information. 
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